import pandas as pdimport numpy as npfrom lets_plot import*LetsPlot.setup_html(isolated_frame=True)# Load datasetdf = pd.read_json("https://github.com/byuidatascience/data4missing/raw/master/data-raw/flights_missing/flights_missing.json")
Question 1 Fix all of the varied missing data types in the data to be consistent (all missing values should be displayed as “NaN”). In your report include one record example (one row) from your new data, in the raw JSON format. Your example should display the “NaN” for at least one missing value.
QUESTION 2 Which airport has the worst delays? Describe the metric you chose, and why you chose it to determine the “worst” airport. Your answer should include a summary table that lists (for each airport) the total number of flights, total number of delayed flights, proportion of delayed flights, and average delay time in hours.
Atlanta Georgia has the worst airport according to the attributes provided. They also have a lot of total flights.
QUESTION 3 What is the best month to fly if you want to avoid delays of any length? Describe the metric you chose and why you chose it to calculate your answer. Include one chart to help support your answer, with the x-axis ordered by month.
To me it looks like November is the best month to fly. Has less then 10,000 delays.
Show the code
ggplot(df.dropna(subset=['month']), aes(x='month', y='num_of_delays_total')) +\ geom_point() +\ labs(title='Best Month to Fly', x='Month', y='Number of Delays')
According to the BTS website, the “Weather” category only accounts for severe weather delays. Mild weather delays are not counted in the “Weather” category, but are actually included in both the “NAS” and “Late-Arriving Aircraft” categories. Your job is to create a new column that calculates the total number of flights delayed by weather (both severe and mild). You will need to replace all the missing values in the Late Aircraft variable with the mean. Show your work by printing the first 5 rows of data in a table.
Use these three rules for your calculations:
100% of delayed flights in the Weather category are due to weather
30% of all delayed flights in the Late-Arriving category are due to weather
Question 4 From April to August, 40% of delayed flights in the NAS category are due to weather. The rest of the months, the proportion rises to 65%
QUESTION 5 Using the new weather variable calculated above, create a barplot showing the proportion of all flights that are delayed by weather at each airport. Describe what you learn from this graph.
Show the code
weather_summary = df.groupby('airport_name').agg( total_flights=('num_of_flights_total', 'sum'), total_weather_delays=('total_weather_delays', 'sum')).reset_index()weather_summary['proportion_delayed_by_weather'] = weather_summary['total_weather_delays'] / weather_summary['total_flights']ggplot(weather_summary, aes(x='airport_name', y='proportion_delayed_by_weather')) +\ geom_point(fill='skyblue') +\ coord_flip() +\ labs(title='Proportion of Flights Delayed by Weather by Airport', x='Airport', y='Proportion Delayed')